53 research outputs found

    A method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data

    Get PDF
    BACKGROUND: We consider the user task of designing clinical trial protocols and propose a method that discovers and outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents [Formula: see text] , a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D, D ⊃ D', by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. The appropriateness is measured by the degree to which they are consistent with the user-supplied sample documents D'. METHOD: We propose a novel three-step method called LDALR which views documents as a mixture of latent topics. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA). Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. RESULTS: Our experiments have shown that LDALR is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments using LDALR, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria. CONCLUSIONS: We have proposed LDALR, a practical method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data. Results from our experiments suggest that LDALR models can be used to effectively find appropriate eligibility criteria from a large repository of clinical trial protocols

    Boosting Drug Named Entity Recognition using an Aggregate Classifier

    Get PDF
    AbstractObjectiveDrug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations.MethodsWe perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition.MaterialsOur approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers.ResultsAggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names.ConclusionWe conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations

    A behaviour biometrics dataset for user identification and authentication

    Get PDF
    As e-Commerce continues to shift our shopping preference from the physical to online marketplace, we leave behind digital traces of our personally identifiable details. For example, the merchant keeps record of your name and address; the payment processor stores your transaction details including account or card information, and every website you visit stores other information such as your device address and type. Cybercriminals constantly steal and use some of this information to commit identity fraud, ultimately leading to devastating consequences to the victims; but also, to the card issuers and payment processors with whom the financial liability most often lies. To this end, we recognise that data is generally compromised in this digital age, and personal data such as card number, password, personal identification number and account details can be easily stolen and used by someone else. However, there is a plethora of data relating to a person's behaviour biometrics that are almost impossible to steal, such as the way they type on a keyboard, move the cursor, or whether they normally do so via a mouse, touchpad or trackball. This data, commonly called keystroke, mouse and touchscreen dynamics, can be used to create a unique profile for the legitimate card owner, that can be utilised as an additional layer of user authentication during online card payments. Machine learning is a powerful technique for analysing such data to gain knowledge; and has been widely used successfully in many sectors for profiling e.g., genome classification in molecular biology and genetics where predictions are made for one or more forms of biochemical activity along the genome. Similar techniques are applicable in the financial sector to detect anomaly in user keyboard and mouse behaviour when entering card details online, such that they can be used to distinguish between a legitimate and an illegitimate card owner. In this article, a behaviour biometrics (i.e., keystroke and mouse dynamics) dataset, collected from 88 individuals, is presented. The dataset holds a total of 1760 instances categorised into two classes (i.e., legitimate and illegitimate card owners’ behaviour). The data was collected to facilitate an academic start-up project (called CyberSignature1) which received funding from Innovate UK, under the Cyber Security Academic Startup Accelerator Programme. The dataset could be helpful to researchers who apply machine learning to develop applications using keystroke and mouse dynamics e.g., in cybersecurity to prevent identity theft. The dataset, entitled ‘Behaviour Biometrics Dataset’, is freely available on the Mendeley Data repository

    ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials

    Get PDF
    Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols
    corecore